Experience Replay Optimization (ERO)

1 Overview

Experience Replay Optimization (ERO) masks stored transitions in order to improve sample efficiency. Additional neural network, “replay policy”, takes features extracted from each transition and infers mask probability. Agent samples from masked transitions uniformly.

In order to train replay policy, binary masks (\( \mathbf{I} \)) produced by Bernoulli distribution are considered as action. The replay-reward (\( r^{r} \)) as defined as the difference between culmutive reward of the previous policy (\( \pi ^{\prime} \)) and that of agent policy (\( \pi \)); \( r ^{r} = r^{c}_{\pi} - r^{c}_{\pi ^{\prime}} \).

The policy gradient for mini batch can be written as follows;

\[ \sum _{j:B_j \in B^{\text{batch}}} r^r \nabla [ \mathbf{I}_j \log \phi + (1-\mathbf{I}_j) \log (1-\phi) ] \]

2 With cpprb

We plan to implement somthing like BernoulliMaskedReplayBuffer to support ERO. Together with such future enhancement, users still need to implement neural network which infers probabilities of Bernoulli masks.

3 References